MeerKAT Observations for ML¶

  • visibilities: raw and calibrated. ~ 1 TB per observation. Many observations of a single target can be combined.
  • dirty images: F^{-1}(uv), produced as 0th CLEAN iteration, not a common data product
  • CLEAN images: either a set number of CLEAN iterations or clean until desired sensitivity (rms) reached. Usually created by a pipeline. Very common data product.
  • "enhanced" images: primary beam correction, source peeling, etc; something additional to the default pipeline processes have been performed
  • continuum image and spectral image cubes (also polarization components)
  • FITS source catalogs: usually produced by PyBDSF or some similar source finding algorithm

Current focus: CLEAN and dirty images for commonly observed targets

  • MIGHTEE survey of COSMOS and XMM fields
    • deep observations, relatively small sky area
    • visibilities, CLEAN, and enhanced images are publicly available
    • source catalogs to make simulations possible
  • MGCLS survey (galactic clusters)
    • semi-deep observations, ~100 clusters
    • CLEAN and enhanced images publicly available (continuum and 14-frequency spectral cube)
    • source catalogs to make simulations possible
    • presence of extended sources

Example CLEAN Images from MIGHTEE and MGCLS¶

Full data product¶

Cropped images for ML (256x256)¶

MIGHTEE enhanced images¶

1300 crops, taken every 200 px (x and y), allowing crops with < 30% NaN

MGCLS basic and enhanced images¶

Beam size can be different between basic and enhanced image product, and coordinate shifts for alignment can be present in the enhanced images as well. Crop coordinates determined from enhanced images, and the corresponding crop is taken from the basic image product, with slight adjustments if the resulting images is not 256x256

23000 crops (paired images)

Pre-trained DINO with MeerKAT images: Vision Transformers¶

what parts of a scientific image do the various attention heads highlight/segment as impoortant?

Example with random crop from MIGHTEE

Example with random crop from MGCLS

DINO with MeerKAT Images: ResNet architecture¶

Example with random crop from MIGHTEE

Example with random crop from MGCLS

experiments with fine-tuning¶

Example with random crop from MIGHTEE

Example with random crop from MIGHTEE

Examining the embeddings¶

ViT embedding size: 384

Resnet50 embedding size: 2048

Visualization with T-SNE¶

PCA¶

'99% of variance explained with 11 components (ViT) and >500 components (resnet)'
'99% of variance explained with 373 components (pre-trained) and 33 components (fine-tuned)'

Downstream tasks: Similarity search with FAISS¶

FAISS query of MIGHTEE on embeddings from pre-trained ViT FAISS on embeddings from pre-trained ViT

FAISS query of MIGHTEE on embeddings from pre-trained resnet FAISS on embeddings from pre-trained resnet

FAISS query of MGCLS on embeddings from pre-trained resnet FAISS on embeddings from pre-trained resnet

FAISS query of MGCLS on embeddings from fine-tuned resnet FAISS on embeddings from fine-tuned resnet

Lessons Learned & Next Steps¶

MGCLS: sensitivity matters!¶

  • Different observations are CLEANED to slightly different sensitivities (source-free rms). In addition to having different observing parameters (eg, time-on-source, how close to the horizon), this can result in quite a distinct set of image crops.
  • Normalizing the data to a similar sensitivity may improve results such that training learns other important characteristics besides what image a crop originated from. How should this ideally be done? Especially considering the beam shape and associated changes in flux.
  • this dataset is great for (image-plane) experimentation. Stokes and spectral cubes would be interesting to investigate, or adding the spectral index as another "channel".

MIGHTEE: need more data¶

  • difficult to draw any conclusions with 1300 images right now... also fine-tuning seems to have very little effect
  • simulations coming soon! Dirty images too.
  • Would it be possible to use even smaller crops? Some work has been done on source-only crops which are very small (not with ViT however), do some more reading

ViT¶

Pre-trained ViTs look promising, from the way the attention heads seem to already select sources. However, fine-tuning with the default set of hyperparameters does not result in the loss decreasing (no matter the dataset size). Is this simply a matter of the correct hyperparameters, or is there something else to consider?

DINO specifics¶

  • Current augmentations are: two global crops and 8 local crops, with random horizontal flip
  • Experimented with adding the CLEAN image as an "augmentation" (using solar flare dirty/CLEAN pairs); needs more data to be conclusive (>2500)
  • Gaussian or other type of noise?
  • What other augmentations could make sense? PSF, rotation? Anything radio-specific at these wavelengths?

Next steps:¶

  • MIGHTEE simulation data
  • Make dirty images from MIGHTEE visibilities
  • How to understand the embedding vectors - for example, does the principal component of the MGCLS-fine-tuned embeddings represent mean flux in a crop?
  • generate some metadata-based labels, such as number of sources in a crop, min/max flux, total flux of detected sources, etc using the catalog information for MIGHTEE and MGCLSS
  • get more solar flare dirty/clean pairs to experiment with the CLEAN image as augmentation
  • Figure out how to average pretrained resnet weights to operate with single-channel images, or experiment with training from scratch
  • investigate ViT-base
  • investigate CLIP with dirty and clean embeddings